Skip to content

security: avoid shell fallback in SLURM requeue#21751

Open
dfgvaetyj3456356-hash wants to merge 1 commit into
Lightning-AI:masterfrom
dfgvaetyj3456356-hash:security/slurm-requeue-no-shell-fallback
Open

security: avoid shell fallback in SLURM requeue#21751
dfgvaetyj3456356-hash wants to merge 1 commit into
Lightning-AI:masterfrom
dfgvaetyj3456356-hash:security/slurm-requeue-no-shell-fallback

Conversation

@dfgvaetyj3456356-hash
Copy link
Copy Markdown

What does this PR do?

This hardens SLURM auto-requeue handling by keeping the requeue call on the argv-based subprocess path and replacing the previous assert validation with an explicit runtime check.

The previous fallback retried scontrol requeue <job_id> through shell=True when scontrol was not found. Since the job id comes from the environment, this PR avoids ever turning that value into a shell command string. The validation also now uses re.fullmatch(...) and raises a normal exception so it is not removed when Python runs with optimizations.

Test Plan

PYTHONPATH=src python -m pytest tests/tests_pytorch/trainer/connectors/test_signal_connector.py -q
python -m py_compile src/lightning/pytorch/trainer/connectors/signal_connector.py tests/tests_pytorch/trainer/connectors/test_signal_connector.py
git diff --check

On Windows, the pytest file reports the SLURM-specific tests as skipped by the existing RunIf(skip_windows=True) marker, so I also ran direct handler proofs outside the marker, including under python -O, and confirmed invalid job ids are rejected before subprocess.call is invoked.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Jun 1, 2026

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79%. Comparing base (932b7e3) to head (2a66582).
✅ All tests successful. No failed tests found.
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

❗ There is a different number of reports uploaded between BASE (932b7e3) and HEAD (2a66582). Click for more details.

HEAD has 2032 uploads less than BASE
Flag BASE (932b7e3) HEAD (2a66582)
cpu 497 41
python 36 3
lightning_fabric 160 0
pytest 248 0
python3.12 143 12
python3.13 106 8
lightning 176 15
python3.11 70 6
python3.12.7 107 9
python3.10 35 3
pytorch2.5.1 18 3
pytest-full 249 41
pytorch2.8 36 6
pytorch_lightning 161 26
pytorch2.9 36 5
pytorch2.10 36 6
pytorch2.1 35 6
pytorch2.2.2 17 3
pytorch2.3 18 3
pytorch2.7 17 3
pytorch2.6 18 3
pytorch2.4.1 18 3
Additional details and impacted files
@@            Coverage Diff            @@
##           master   #21751     +/-   ##
=========================================
- Coverage      87%      79%     -8%     
=========================================
  Files         270      267      -3     
  Lines       23973    23912     -61     
=========================================
- Hits        20748    18804   -1944     
- Misses       3225     5108   +1883     

@dfgvaetyj3456356-hash
Copy link
Copy Markdown
Author

Thanks for the CI run. It looks like the remaining red status is from a cancelled Ubuntu pl-cpu matrix job plus the pl-cpu guardian, while the other checks and Codecov report are green. I do not see a code failure from the changed SLURM requeue path. Could a maintainer rerun the cancelled job/guardian when convenient? Happy to adjust if a real failure appears.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants